Regular Expression Parser/Matcher

Overview

These classes implement a parser/matcher for regular expressions.

Documentation

The following text is an html- transcription of the text found in the class methods in "documentation-manual" of the Regex::RxParser class. In addition, if you need further info: "use the source - Luke".

These pages will not teach you regular expression usage nor the Smalltalk language.
For regular expressions, the following excellent book is recommended:

Mastering Regular Expressions by Jeffrey E.F. Friedl, O'Reilly (http://www.oreilly.com/catalog/regex).

For smalltalk literature, please refer to the "Reading List".

Introduction

A regular expression is a template specifying a class of strings. A regular expression matcher is an tool that determines whether a string belongs to a class specified by a regular expression. This is a common task of a user input validation code, and the use of regular expressions can GREATLY simplify and speed up development of such code.
As an example, here is how to verify that a string is a valid hexadecimal number in Smalltalk notation, using this matcher package:

	aString matchesRegex: '16r[[:xdigit:]]+'

(Coding the same ``the hard way'' is an exercise to a curious reader).

This matcher is offered to the Smalltalk community in hope it will be useful. It is free in terms of money, and to a large extent -- in terms of rights of use. Refer to `Boring Stuff' section for legalese.

The 'What's new in this release' section describes the functionality introduced in 1.1 release.
The `Syntax' section explains the recognized syntax of regular expressions.
The `Usage' section explains matcher capabilities that go beyond what String » matchesRegex: method offers.
The `Implementation notes' sections says a few words about what is under the hood.

Happy hacking,
Vassili Bykov <vassili@objectpeople.com> <vassili@magma.ca>

August 6, 1996 (first release)
April 4, 1999 (rel1.1)

Whats new in Version1.1 (Oct 1999)

Regular expression syntax corrections and enhancements:

Backslash escapes similar to those in Perl are allowed in patterns:

\w
any word constituent character (equivalent to [a-zA-Z0-9_])
\W
any character but a word constituent (equivalent to [^a-xA-Z0-9_])
\d
a digit (same as [0-9])
\D
anything but a digit
\s
a whitespace character
\S
anything but a whitespace character
\b
an empty string at a word boundary
\B
an empty string not at a word boundary
\<
an empty string at the beginning of a word
\>
an empty string at the end of a word
For example, '\w+' is now a valid expression matching any word.
The following backslash escapes are also allowed in character sets (between square brackets):
```
    \w, \W, \d, \D, \s, and \S.
```

The following grep(1)-compatible named character classes are recognized in character sets as well:

    [:alnum:]
    [:alpha:]
    [:blank:]
    [:cntrl:]
    [:digit:]
    [:graph:]
    [:lower:]
    [:print:]
    [:punct:]
    [:space:]
    [:upper:]
    [:xdigit:]

For example, the following patterns are equivalent:

    '[[:alnum:]]+'
    '\w+'
    '[\w]+'
    '[a-zA-Z0-9_]+'

Some non-printable characters can be represented in regular expressions using a common backslash notation:

    \t      tab (Character tab)
    \n      newline (Character lf)
    \r      carriage return (Character cr)
    \f      form feed (Character newPage)
    \e      escape (Character esc)

A dot is correctly interpreted as 'any character but a newline' instead of 'anything but whitespace'.
Case-insensitive matching. The easiest access to it are new messages CharacterArray understands: #asRegexIgnoringCase #matchesRegexIgnoringCase: #prefixMatchesRegexIgnoringCase:

The matcher (an instance of RxMatcher, the result of String » asRegex) now provides a collection-like interface to matches in a particular string or on a particular stream, as well as substitution protocol. The interface includes the following messages:

    matchesIn: aString
    matchesIn: aString collect: aBlock
    matchesIn: aString do: aBlock

    matchesOnStream: aStream
    matchesOnStream: aStream collect: aBlock
    matchesOnStream: aStream do: aBlock

    copy: aString translatingMatchesUsing: aBlock
    copy: aString replacingMatchesWith: replacementString

    copyStream: aStream to: writeStream translatingMatchesUsing: aBlock
    copyStream: aStream to: writeStream replacingMatchesWith: aString

Examples:

    '\w+' asRegex matchesIn: 'now is the time'

returns an OrderedCollection containing four strings: 'now', 'is', 'the', and 'time'.

    '\<t\w+' asRegexIgnoringCase
	    copy: 'now is the Time'
	    translatingMatchesUsing: [:match | match asUppercase]

returns 'now is THE TIME' (the regular expression matches words beginning with either an uppercase or a lowercase T).

Syntax

[You can `print it' examples in this text. ]

Exact Character Match

The simplest regular expression is a single character. It matches exactly that character. A sequence of characters matches a string with exactly the same sequence of characters:

    'a' matchesRegex: 'a'                   "-> true"

    'foobar' matchesRegex: 'foobar'         "-> true"

    'blorple' matchesRegex: 'foobar'        "-> false"

The above paragraph introduced a primitive regular expression (a character), and an operator (sequencing). Operators are applied to regular expressions to produce more complex regular expressions. Sequencing (placing expressions one after another) as an operator is, in a certain sense, `invisible'--yet it is arguably the most common.

Any Character Match ( . )

The special `any' character "." (dot) matches ANY character EXCEPT newline.
Thus

    'abc' matchesRegex: 'a..'               "-> true"

    'abcd' matchesRegex: 'a..'              "-> false"

actually it matches any 3-character string, except those which include a newline character.

Repeated Matches ( * for ZERO or more occurrences)

A more `visible' operator is Kleene closure, more often simply referred to as `a star'. A regular expression followed by an asterisk (`*') matches any number (including 0) of matches of the original expression.
For example:

    'ab' matchesRegex: 'a*b'                "-> true"

    'aaaaab' matchesRegex: 'a*b'            "-> true"

    'b' matchesRegex: 'a*b'                 "-> true"

    'aac' matchesRegex: 'a*b'               "-> false: b does not match"

    '123aa' matchesRegex: '.*aa'            "-> true (matches any string which ends with 'aa', but not containing a newline)"

    '123aa456' matchesRegex: '.*aa.*'       "-> true (matches any string containing 'aa', but not containing a newline)"

A star's precedence is higher than that of sequencing. A star applies to the shortest possible subexpression that precedes it. For example, 'ab*' means `a followed by zero or more occurrences of b', not `zero or more occurrences of ab':

    'abbb' matchesRegex: 'ab*'              "-> true"

    'abab' matchesRegex: 'ab*'              "-> false"

Parentheses for Grouping

To actually make a regex matching `zero or more occurrences of ab', `ab' is enclosed in parentheses:

    'abab' matchesRegex: '(ab)*'            "-> true"

    'abcab' matchesRegex: '(ab)*'           "-> false: c spoils the fun"

Repeated Matches ( + for ONE or more occurrences, ? for ZERO or ONE occurrence)

Two other operators similar to `*' are `+' and `?'.

`+' (positive closure, or simply `plus') matches one or more occurrences of the original expression (i.e. at least one).
`?' (`optional') matches zero or one, but never more, occurrences.

For example:

    'ac' matchesRegex: 'ab*c'               "-> true"

    'ac' matchesRegex: 'ab+c'               "-> false: need at least one b"

    'abbc' matchesRegex: 'ab+c'             "-> true"

    'abbc' matchesRegex: 'ab?c'             "-> false: too many b's"

    'ac' matchesRegex: 'ab?c'               "-> true: the b is optional"

Escaping Special Characters

As we have seen, characters `*', `+', `?', `(', and `)' have special meaning in regular expressions. If one of them is to be used literally, it should be quoted: preceded with a backslash. (Thus, backslash is also special character, and needs to be quoted for a literal match--as well as any other special character described further).

    'ab*' matchesRegex: 'ab*'               "-> false: star in the right string is special"

    'ab*' matchesRegex: 'ab\*'              "-> true"

    'a\c' matchesRegex: 'a\\c'              "-> true"

Alternative Match Patterns ( | )

The last operator is `|' meaning `or'.
It is placed between two regular expressions, and the resulting expression matches if one of the expressions matches. It has the lowest possible precedence (lower than sequencing). For example, `ab*|ba*' means `a followed by any number of b's, or b followed by any number of a's':

    'abb' matchesRegex: 'ab*|ba*'           "-> true"

    'baa' matchesRegex: 'ab*|ba*'           "-> true"

    'baab' matchesRegex: 'ab*|ba*'          "-> false"

A bit more complex example is the following expression, matching the name of any of the Lisp-style `car', `cdr', `caar', `cadr', ... functions:

    c(a|d)+r

It is possible to write an expression matching an empty string, for example: `a|'. However, it is an error to apply `*', `+', or `?' to such expression: `(a|)*' is an invalid expression.

Character Sets ( [ ... ] )

So far, we have used only characters as the 'smallest' components of regular expressions. There are other, more `interesting', components.

A character set is a string of characters enclosed in square brackets. It matches any single character if it appears between the brackets.
For example, `[01]' matches either `0' or `1':

    '0' matchesRegex: '[01]'         "-> true"

    '3' matchesRegex: '[01]'         "-> false"

    '11' matchesRegex: '[01]'        "-> false: a set matches only one character"

Using the plus operator, we can build the following binary number recognizer:

    '10010100' matchesRegex: '[01]+'        "-> true"

    '10001210' matchesRegex: '[01]+'        "-> false"

Inverted Character Set ( [ ^... ] )

If the first character after the opening bracket is `^', the set is inverted: it matches any single character *not* appearing between the brackets:

    '0' matchesRegex: '[^01]'               "-> false"

    '3' matchesRegex: '[^01]'               "-> true"

Character Ranges in a Set ( [ x-y ] )

For convenience, a set may include ranges: pairs of characters separated with `-'. This is equivalent to listing all characters between them: `[0-9]' is the same as `[0123456789]'.

Special characters within a set are `^', `-', and `]' that closes the set.
Below are the examples of how to literally use them in a set:

    [01^]           -- put the caret anywhere except the beginning
    [01-]           -- put the dash as the last character
    []01]           -- put the closing bracket as the first character
    [^]01]             (thus, empty and universal sets cannot be specified)

Special Characters in a Set

Be careful: `.' and similar special characters are no longer special inside the character set;
therefore:

    '1' matchesRegex: '[1.]'         "-> true"

and:

    '.' matchesRegex: '[1.]'         "-> true"

but not:

    '2' matchesRegex: '[1.]'         "-> false"

Common Character Classes

Regular expressions can also include the following backquote escapes to refer to popular classes of characters:

    \w      any word constituent character (same as [a-zA-Z0-9_])
    \W      any character but a word constituent
    \d      a digit (same as [0-9])
    \D      anything but a digit
    \s      a whitespace character
    \S      anything but a whitespace character

These escapes are also allowed in character classes: '[\w+-]' means 'any character that is either a word constituent, or a plus, or a minus'.

Character classes can also include the following grep(1)-compatible elements to refer to:

    [:alnum:]               any alphanumeric, i.e., a word constituent, character
    [:alpha:]               any alphabetic character
    [:blank:]               space or tab.
    [:cntrl:]               any control character.
			    In this version, it means any character whith ascii-code is < 32.
    [:digit:]               any decimal digit.
    [:graph:]               any graphical character.
			    In this version, this mean any character with ascii-code >= 32.
    [:lower:]               any lowercase character
    [:print:]               any printable character.
			    In this version, this is the same as [:cntrl:]
    [:punct:]               any punctuation character.
    [:space:]               any whitespace character.
    [:upper:]               any uppercase character.
    [:xdigit:]              any hexadecimal character.

Note that these elements are components of the character classes, i.e. they have to be enclosed in an extra set of square brackets to form a valid regular expression.
For example, a non-empty string of digits would be represented as '[[:digit:]]+'.

Smalltalk Specific: Character Test Messages

The above primitive expressions and operators are common to many implementations of regular expressions. The next primitive expression is unique to this Smalltalk implementation.

A sequence of characters between colons is treated as a unary selector which is supposed to be understood by characters. A character matches such an expression if it answers true to a message with that selector. This allows a more readable and efficient way of specifying character classes (by adding appropriate protocol to the character class, it can also be easily extended).
For example, `[0-9]' is equivalent to `:isDigit:', but the latter is more efficient. Analogously to character sets, character classes can be negated: `:^isDigit:' matches a Character that answers false to #isDigit, and is therefore equivalent to `[^0-9]'.

The following messages from Smalltalk's Character protocol are useful here:

    :isControlCharacter:  true if I am a control character (i.e. ascii value < 32 or == 16rFF)
    :isDigit:             as described above
    :isLetter:            a-z or A-Z
    :isLetterOrDigit:     a-z or A-Z or 0-9
    :isNationalLetter:    any letter in the whole Unicode set (not just a-z, A-Z)
    :isNationalAlphaNumeric:  any letter or digit from the Unicode set
    :isLowercase:         any lowercase letter in the Unicode set (i.e. not only a-z)
    :isUppercase:         any uppercase letter in the Unicode set (i.e. not only A-Z)
    :isSeparator:         any whitespace (space, nl, cr, tab, ff)
    :isVowel:             aeiouAEIOU
    :isHexDigit:          0-9, a-f, A-F

As an summarizing example, so far we have seen the following equivalent ways to write a regular expression that matches a non-empty string of digits:

    '[0-9]+'
    '\d+'
    '[\d]+'
    '[[:digit::]+'
    :isDigit:+'

More Special Characters

The last group of special primitive expressions includes:

    .       matching any character except a newline;
    ^       matching an empty string at the beginning of a line;
    $       matching an empty string at the end of a line.
    \b      an empty string at a word boundary
    \B      an empty string not at a word boundary
    \<      an empty string at the beginning of a word
    \>      an empty string at the end of a word

Again, all the above three characters (`.', `^' and `$') are special and should be quoted to be matched literally.

Examples:

    'axyzb' matchesRegex: 'a.+b'            "-> true"

    'ax zb' matchesRegex: 'a.+b'            "-> true (space is matched by `.')"

    ('ax' , Character cr ,'zb')
	matchesRegex: 'a.+b'                "-> false (newline is not matched by `.')"

    ('ax' , Character cr ,'zb')
	matchesRegex: 'a(.|\n)+b'           "-> true)"

EXAMPLES

As the introductions said, a great use for regular expressions is user input validation. Following are a few examples of regular expressions that might be handy in checking input entered by the user in an input field. Try them out by entering something between the quotes and print-iting. (Also, try to imagine Smalltalk code that each validation would require if coded by hand). Most example expressions could have been written in alternative ways.

Checking if aString may represent a nonnegative integer number:

    aString matchesRegex: ':isDigit:+'

    aString matchesRegex: '[0-9]+'

    aString matchesRegex: '\d+'

Checking if aString may represent an integer number with an optional sign in front:

    aString matchesRegex: '(\+|-)?\d+'

Checking if aString is a fixed-point number, with at least one digit is required after a dot:

    aString matchesRegex: '(\+|-)?\d+(\.\d+)?'

The same, but allow notation like `123.':

    aString matchesRegex: '(\+|-)?\d+(\.\d*)?'

Recognizer for a string that might be a name: one word with first capital letter, no blanks, no digits. More traditional:

    aString matchesRegex: '[A-Z][A-Za-z]*'

more Smalltalkish:

    aString matchesRegex: ':isUppercase::isAlphabetic:*'

A date in format MMM DD, YYYY with any number of spaces in between, in XX century:

    aString matchesRegex: '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ ]+(\d\d?)[ ]*,[ ]*19(\d\d)'

Note parentheses around some components of the expression above. As `Usage' section shows, they will allow us to obtain the actual strings that have matched them (i.e. month name, day number, and year number).

For dessert, coming back to numbers: here is a recognizer for a general number format: anything like 999, or 999.999, or -999.999e+21.

    aString matchesRegex: '(\+|-)?\d+(\.\d*)?((e|E)(\+|-)?\d+)?'

Usage

The preceding section covered the syntax of regular expressions. It used the simplest possible interface to the matcher: sending a #matchesRegex:-message to the sample string, with a regular expression string as the argument.
This section explains hairier ways of using the matcher.

Prefix Matching and Case-Insensitive Matching

A CharacterArray also understands these messages:

    aString prefixMatchesRegex: regexString
    aString matchesRegexIgnoringCase: regexString
    aString prefixMatchesRegexIgnoringCase: regexString

#prefixMatchesRegex: is just like #matchesRegex, except that the whole receiver is not expected to match the regular expression passed as the argument; matching just a prefix of it is enough.
For example:

    'abcde' matchesRegex: '(a|b)+'          "-> false"

    'abcde' prefixMatchesRegex: '(a|b)+'    "-> true"

The last two messages are case-insensitive versions of matching.

ENUMERATION INTERFACE

An application can be interested in all matches of a certain regular expression within a String. The matches are accessible using a protocol modelled after the familiar Collection-like enumeration protocol:

    aString regex: regexString matchesDo: aBlock

Evaluates a one-argument <aBlock> for every match of the regular expression within the receiver string.

    aString regex: regexString matchesCollect: aBlock

Evaluates a one-argument <aBlock> for every match of the regular expression within the receiver string. Collects results of evaluations and anwers them as a SequenceableCollection.

    aString allRegexMatches: regexString

Returns a collection of all matches (substrings of the receiver string) of the regular expression.
It is an equivalent of

    aString regex: regexString matchesCollect: [:each | each].

REPLACEMENT AND TRANSLATION

It is possible to replace all matches of a regular expression with a certain string using the message:

    aString copyWithRegex: regexString matchesReplacedWith: aString

For example:

    'ab cd ab' copyWithRegex: '(a|b)+' matchesReplacedWith: 'foo'

returns the string: 'foo cd foo'.

A more general substitution is match translation:

    aString copyWithRegex: regexString matchesTranslatedUsing: aBlock

This message evaluates a block passing it each match of the regular expression in the receiver string and answers a copy of the receiver with the block results spliced into it in place of the respective matches.
For example:

    'ab cd ab' copyWithRegex: '(a|b)+' matchesTranslatedUsing: [:each | each asUppercase]

results in the string: 'AB cd AB'.

All messages of enumeration and replacement protocols perform a case-sensitive match. Case-insensitive versions are not provided as part of a CharacterArray protocol. Instead, they are accessible using the lower-level matching interface.

LOWER-LEVEL INTERFACE

Internally, aString matchesRegex: works as follows:

A fresh instance of RxParser is created, and the regular expression string is passed to it, yielding the expression's syntax tree.
The syntax tree is passed as an initialization parameter to an instance of RxMatcher. The instance sets up some data structure that will work as a recognizer for the regular expression described by the tree.
The original string is passed to the matcher, and the matcher checks for a match.

THE MATCHER

If you repeatedly match a number of strings against the same regular expression using one of the messages defined in CharacterArray, the regular expression string is parsed and a matcher is created anew for every match. You can avoid this overhead by building a matcher for the regular expression, and then reusing the matcher over and over again. You can, for example, create a matcher at a class or instance initialization stage, and store it in a variable for future use.

You can create a matcher using one of the following methods:

Sending a forString:ignoreCase: message to RxMatcher class, with the regular expression string and a Boolean indicating whether case is ignored as arguments.
Sending forString: message.
It is equivalent to "... forString: regexString ignoreCase: false".

A more convenient way is using one of the two matcher-created messages understood by CharacterArray.

"regexString asRegex" is equivalent to "RxMatcher forString: regexString".
"regexString asRegexIgnoringCase"
is equivalent to "RxMatcher forString: regexString ignoreCase: true".

Here are four examples of creating a matcher:

    hexRecognizer := RxMatcher forString: '16r[0-9A-Fa-f]+'
    hexRecognizer := RxMatcher forString: '16r[0-9A-Fa-f]+' ignoreCase: false
    hexRecognizer := '16r[0-9A-Fa-f]+' asRegex
    hexRecognizer := '16r[0-9A-F]+' asRegexIgnoringCase

MATCHING

The matcher understands these messages (all of them return true to indicate successful match or search, and false otherwise):

matches: aString: True if the whole target string (aString) matches.
matchesPrefix: aString: True if some prefix of the string (not necessarily the whole string) matches.
search: aString: Search the string for the first occurrence of a matching substring. (Note that the first two methods only try matching from the very beginning of the string). Using the above example with a matcher for `a+', this method would answer success given a string `baaa', while the previous two would fail.
matchesStream: aStream
matchesStreamPrefix: aStream
searchStream: aStream: Respective analogs of the first three methods, taking input from a stream instead of a string. The stream must be positionable and peekable.

All these methods answer a boolean indicating success. The matcher also stores the outcome of the last match attempt and can report it:

lastResult: Answers a Boolean -- the outcome of the most recent match attempt. If no matches were attempted, the answer is unspecified.

SUBEXPRESSION MATCHES

After a successful match attempt, you can query the specifics of which part of the original string has matched which part of the whole expression. A subexpression is a parenthesized part of a regular expression, or the whole expression. When a regular expression is compiled, its subexpressions are assigned indices starting from 1, depth-first, left-to-right.
For example, `((ab)+(c|d))?ef' includes the following subexpressions with these indices:

	  1:      ((ab)+(c|d))?ef
	  2:      (ab)+(c|d)
	  3:      ab
	  4:      c|d

Be aware, that the first subexpressions represents the whole match.
After a successful match, the matcher can report what part of the original string matched what subexpression. It understandards these messages:

subexpressionCount: Answers the total number of subexpressions: the highest value that can be used as a subexpression index with this matcher. This value is available immediately after initialization and never changes.
subexpression: anIndex: An index must be a valid subexpression index, and this message must be sent only after a successful match attempt. The method answers a substring of the original string the corresponding subexpression has matched to.
subBeginning: anIndex
subEnd: anIndex: Answer positions within the original string or stream where the match of a subexpression with the given index has started and ended, respectively.

This facility provides a convenient way of extracting parts of input strings of complex format. For example, the following piece of code uses the 'MMM DD, YYYY' date format recognizer example from the `Syntax' section to convert a date to a three-element array with year, month, and day strings (you can select and evaluate it right here):

    | matcher |
    matcher := Regex::RxMatcher new initializeFromString: '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ ]+(:isDigit::isDigit:?)[ ]*,[ ]*19(:isDigit::isDigit:)'.
    (matcher matches: 'Aug 6, 1996')
	    ifTrue:
		    [Array
			    with: (matcher subexpression: 4)
			    with: (matcher subexpression: 2)
			    with: (matcher subexpression: 3)]
	    ifFalse: ['no match']

(should answer `#('96' 'Aug' '6')').

ENUMERATION AND REPLACEMENT

The enumeration and replacement protocols exposed in CharacterArray are actually implemented by the mather.
The following messages are understood:

    matchesIn: aString
    matchesIn: aString do: aBlock
    matchesIn: aString collect: aBlock
    copy: aString replacingMatchesWith: replacementString
    copy: aString translatingMatchesUsing: aBlock

    matchesOnStream: aStream
    matchesOnStream: aStream do: aBlock
    matchesOnStream: aStream collect: aBlock
    copy: sourceStream to: targetStream replacingMatchesWith: replacementString
    copy: sourceStream to: targetStream translatingMatchesWith: aBlock

ERROR HANDLING

Exception signaling objects are accessible through RxParser class protocol. To handle possible errors, use the protocol described below to obtain the exception objects and use the protocol of the native Smalltalk implementation to handle them.

If a syntax error is detected while parsing expression, RxParser » syntaxErrorSignal is raised/signaled.

If an error is detected while building a matcher, RxParser » compilationErrorSignal is raised/signaled.

If an error is detected while matching (for example, if a bad selector was specified using `:<selector>:' syntax, or because of the matcher's internal error), RxParser » matchErrorSignal is raised

RxParser » regexErrorSignal is the parent of all three. Since any of the three signals can be raised within a call to #matchesRegex:, it is handy if you want to catch them all.

For example:

Ansi-Smalltalk (VisualWorks, SmalltalkX, Squeak etc.):

    [ 'abc' matchesRegex: '))garbage[' ]
	on: RxParser regexErrorSignal
	do: [:ex | ex returnWith: nil]

VisualWorks, SmalltalkX:

    RxParser regexErrorSignal
	handle: [:ex | ex returnWith: nil]
	do: [ 'abc' matchesRegex: '))garbage[' ]

VisualAge, SmalltalkX:

    [ 'abc' matchesRegex: '))garbage[' ]
	when: RxParser regexErrorSignal
	do: [:signal | signal exitWith: nil]

Implementation

Version: 1.1
Mail to: Vassili Bykov <vassili@magma.ca>, <vassili@objectpeople.com>
Flames to: /dev/null

WHAT IS ADDED

The matcher includes classes in two categories:

    VB-Regex-Syntax
    VB-Regex-Matcher

and a few CharacterArray methods in `VB-regex' protocol. No system classes or methods are modified.

WHAT TO LOOK AT FIRST

String » matchesRegex:: in 90% cases this method is all you need to access the package.
RxParser: accepts a string or a stream of characters with a regular expression, and produces a syntax tree corresponding to the expression. The tree is made of instances of Rxs<whatever> classes.
RxMatcher: accepts a syntax tree of a regular expression built by the parser and compiles it into a matcher: a structure made of instances of Rxm<whatever> classes. The RxMatcher instance can test whether a string or a positionable stream of characters matches the original regular expression, or search a string or a stream for substrings matching the expression. After a match is found, the matcher can report a specific string that matched the whole expression, or any parenthesized subexpression of it.

All other classes support the above functionality and are used by RxParser, RxMatcher, or both.

CAVEATS

The matcher is similar in spirit, but NOT in the design--let alone the code--to the original Henry Spencer's regular expression implementation in C. The focus is on simplicity, not on efficiency. I didn't optimize or profile anything. I may in future--or I may not: I do this in my spare time and I don't promise anything.

The matcher passes H. Spencer's test suite (see 'test suite' protocol), with quite a few extra tests added, so chances are good there are not too many bugs. But watch out anyway.

EXTENSIONS, FUTURE, ETC.

With the existing separation between the parser, the syntax tree, and the matcher, it is easy to extend the system with other matchers based on other algorithms. In fact, I have a DFA-based matcher right now, but I don't feel it is good enough to include it here. I might add automata-based matchers later, but again I don't promise anything.

HOW TO REACH ME

As of today (October 3, 1999), you can contact me at <vassili@objectpeople.com>. If this doesn't work, look around comp.lang.smalltalk and comp.lang.lisp.

Boring Stuff

This license applies to the package as a whole, as well as to any component of it. By performing any of the activities described below, you accept the terms of this agreement.
The software is provided free of charge, and ``as is'', in hope that it will be useful, with ABSOLUTELY NO WARRANTY. The entire risk and all responsibility for the use of the software is with you. Under no circumstances the author may be held responsible for loss of data, loss of profit, or any other damage resulting directly or indirectly from the use of the software, even if the damage is caused by defects in the software.
You may use this software in any applications you build.
You may distribute this software provided that the software documentation and copyright notices are included and intact.
You may create and distribute modified versions of the software, such as ports to other Smalltalk dialects or derived work, provided that:

a. any modified version is expressly marked as such and is not misrepresented as the original software;
b. credit is given to the original software in the source code and documentation of the derived work;
c. the copyright notice at the top of this document accompanies copyright notices of any modified version.

ACKNOWLEDGEMENTS

Since the first release of the matcher, thanks to the input from several fellow Smalltalkers, I became convinced a native Smalltalk regular expression matcher was worth the effort to keep it alive. For the advice and encouragement that made this release possible, I want to thank:

    Felix Hack
    Eliot Miranda
    Robb Shecter
    David N. Smith
    Francis Wolinski

and anyone whom I haven't yet met or heard from, but who agrees this has not been a complete waste of time.

More Examples

Via string protocol:

    'hello world' matchesRegex: 'h.*d'

Or:

    |matcher|

    matcher := '.*ll.*' asRegex.
    matcher matches: 'hello world'.

Fetching matched subexpressions:

    |matcher sub1 sub2 sub3|

    matcher := '\D*([0-9]+)\s([0-9]+)\D*.*' asRegex.
    (matcher matches: 'bla bla 123456 123 bla bla') ifTrue:[
	Transcript showCR:(matcher subexpressionCount printString , ' subExpressions').
	sub1 := matcher subexpression:1.
	sub2 := matcher subexpression:2.
	sub3 := matcher subexpression:3.
	Transcript showCR:'subExpr1 is ' , sub1.
	Transcript showCR:'subExpr2 is ' , sub2.
	Transcript showCR:'subExpr3 is ' , sub3.
    ].

Licensing

This addOn package is NOT to be considered part of the base ST/X system. It is provided physically with the ST/X delivery, but only for your convenience.

Legally, it is a freeware or public domain goody, as specified in the goodies copyright notice (see the goodies source).

No Warranty

This goody is provided AS-IS without any warranty whatsoever.

Origin/Authors

Found in and ported from the smalltalk archives.
Author:

Vassili Bykov

See RxParser class » boringStuff for legal information.

<info@exept.de>

Doc $Revision: 1.16 $